1 Introduction

In this file we will take the perspective of an investor to the lending club. Our goal is to select a subset of the most promising loans to invest in. We will do so using the method of logistic regression.

1.1 Load the data

lc_raw <- read_csv("LendingClub Data.csv",  skip=1) %>%  #since the first row is a title we want to skip it. 
  clean_names() # use janitor::clean_names()

2 ICE the data: Inspect, Clean, Explore

2.1 Inspect the data

glimpse(lc_raw)
## Rows: 42,538
## Columns: 80
## $ int_rate            <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.…
## $ loan_amnt           <dbl> 8000, 6000, 6500, 8000, 5500, 6000, 10200, 15000, …
## $ term_months         <dbl> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36…
## $ installment         <dbl> 241.28, 180.96, 196.04, 241.28, 165.88, 180.96, 30…
## $ dti                 <dbl> 2.11, 5.73, 17.68, 22.71, 5.75, 21.92, 18.62, 9.72…
## $ delinq_2yrs         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ annual_inc          <dbl> 50000, 52800, 35352, 79200, 240000, 89000, 110000,…
## $ grade               <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", …
## $ emp_title           <chr> NA, "coral graphics", NA, "Honeywell", "O T Plus, …
## $ emp_length          <chr> "5 years", "< 1 year", "n/a", "n/a", "10+ years", …
## $ home_ownership      <chr> "MORTGAGE", "MORTGAGE", "MORTGAGE", "MORTGAGE", "M…
## $ verification_status <chr> "Verified", "Source Verified", "Not Verified", "Ve…
## $ issue_d             <chr> "9/1/2011", "9/1/2011", "9/1/2011", "9/1/2011", "9…
## $ zip_code            <chr> "977xx", "228xx", "864xx", "322xx", "278xx", "760x…
## $ addr_state          <chr> "OR", "VA", "AZ", "FL", "NC", "TX", "CT", "WA", "N…
## $ loan_status         <chr> "Fully Paid", "Charged Off", "Fully Paid", "Fully …
## $ desc                <chr> "Borrower added on 09/08/11 > Consolidating debt f…
## $ purpose             <chr> "debt_consolidation", "vacation", "credit_card", "…
## $ title               <chr> "Credit Card Payoff $8K", "bad choice", "Credit Ca…
## $ x20                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x21                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x22                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x23                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x24                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x25                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x26                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x27                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x28                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x29                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x30                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x31                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x32                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x33                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x34                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x35                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x36                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x37                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x38                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x39                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x40                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x41                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x42                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x43                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x44                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x45                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x46                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x47                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x48                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x49                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x50                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x51                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x52                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x53                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x54                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x55                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x56                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x57                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x58                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x59                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x60                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x61                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x62                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x63                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x64                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x65                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x66                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x67                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x68                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x69                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x70                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x71                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x72                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x73                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x74                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x75                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x76                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x77                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x78                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x79                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ x80                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

2.2 Clean the data

There are no missing values or duplicates in the data set. Then we continue to clean the data.

The variable “loan_status” contains information as to whether the loan has been repaid or charged off (i.e., defaulted). Let’s create a binary factor variable for this. This variable will be our focus.

lc_clean<- lc_raw %>%
  dplyr::select(-x20:-x80) %>% #delete empty columns
  filter(!is.na(int_rate)) %>%   #delete empty rows
  mutate(
    issue_d = mdy(issue_d),  # lubridate::mdy() to fix date format
    term = factor(term_months),     # turn 'term' into a categorical variable
    delinq_2yrs = factor(delinq_2yrs) # turn 'delinq_2yrs' into a categorical variable
  ) %>% 
  mutate(default = dplyr::recode(loan_status, 
                      "Charged Off" = "1", 
                      "Fully Paid" = "0"))%>%
    mutate(default = as.factor(default)) %>%
  dplyr::select(-emp_title,-installment, -term_months, everything()) #move some not-so-important variables to the end. 
# for some of the caret commands it doesn't like it when default is equal to 0 or 1. So we change it to "D" for default and "P" for paid back.
lc_clean<-lc_clean%>%
  mutate(def=as.factor(ifelse(default=="1","D","P")))

2.3 Explore the data

Let’s explore loan defaults by creating different visualizations. We start with examining how prevalent defaults are, whether the default rate changes by loan grade or number of delinquencies, and a couple of scatter plots of defaults against loan amount and income.

#bar chart of defaults
def_vis1<-ggplot(data=lc_clean, aes(x=default)) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  labs(x="Default, 1=Yes, 0=No", y="relative frequencies") +
  scale_y_continuous(labels=scales::percent) +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),
                 y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5) 
def_vis1

#bar chart of defaults per loan grade
def_vis2<-ggplot(data=lc_clean, aes(x=default), group=grade) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  +
  labs(title="Defaults by Grade", x="Default, 1=Yes, 0=No", y="relative frequencies") +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~grade) + 
  theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ),
            stat= "count",vjust=-0.5, size = 3) +
  NULL
def_vis2

#bar chart of defaults per number of Delinquencies
def_vis3<-lc_clean %>%
  filter(as.numeric(delinq_2yrs)<4) %>%
  ggplot(aes(x=default), group=delinq_2yrs) +
  geom_bar(aes(y = (..count..)/sum(..count..),
               fill = factor(..x..)), stat="count")  +
  labs(title="Defaults by Number of Delinquencies",
       x="Default, 1=Yes, 0=No", y="relative frequencies")  +
  scale_y_continuous(labels=scales::percent) +facet_grid(~delinq_2yrs) + 
  theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5)+
  NULL

def_vis3

#scatter plots 

#We select 2000 random loans to display only to make the display less busy. 
set.seed(1234)
reduced<-lc_clean[sample(0:nrow(lc_clean), 2000, replace = FALSE),]%>%
  mutate(default=as.numeric(default)-1) # also convert default to a numeric {0,1} to make it easier to plot.

          
# scatter plot of defaults against loan amount                         
def_vis4<-ggplot(data=reduced, aes(y=default,x=I(loan_amnt/1000)))  +
  labs(y="Default, 1=Yes, 0=No", x="Loan Amnt (1000 $)") +
  geom_jitter(width=0, height=0.05, alpha=0.7) + #We use jitter to offset the display of defaults/non-defaults to make the data easier to interpert. We have also changed the amount to 1000$ to reduce the number of zeros on the horizontal axis.
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"))+
  NULL

def_vis4

#scatter plot of defaults against loan amount.
def_vis5<-ggplot(data=reduced, aes(y=default,x=I(annual_inc/1000))) + 
  labs(y="Default, 1=Yes, 0=No", x="Annual Income(1000 $)") +
  geom_jitter(width=0, height=0.05, alpha=0.7) + 
  xlim(0,400)+
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"))+
  NULL

def_vis5

From the visualizations above, we can observe: - 86% non default loans and 14% defaults in the data set. - Borrowers of grade A have a less probability of defaults. Generally speaking, the lower the grade is, the higher the chance of default is. - People with a higher number of delinquencies are also more likely to default. - Default probability is positively related to loan amount, and negatively related to annual income.

We can also estimate a correlation table between defaults and other continuous variables.

# correlation table using GGally::ggcor()
# this takes a while to plot

lc_clean %>% 
  mutate(default=as.numeric(default)-1)%>%
  select(loan_amnt, dti, annual_inc, default) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), label_round=2, label = TRUE)+
  NULL

The correlation coefficients shows dti is positively related to default probability. In addition, loan amount is positively related to annual income. People who earn more tend to lend more.

Section 1: Visulazations

# term months
def_vis6 <- lc_clean %>%
  ggplot(aes(x=default), group=term) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  + 
  labs(title="Defaults by Term Months", x="Default, 1=Yes, 0=No", y="relative frequencies")  +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~term) + theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),
                 y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5)+
  NULL
def_vis6

# home ownership
def_vis7 <- lc_clean %>%
  ggplot(aes(x=default), group=home_ownership) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  + 
  labs(title="Defaults by Home Ownership", x="Default, 1=Yes, 0=No", y="relative frequencies")  +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~home_ownership) + theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),
                 y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5)+
  NULL
def_vis7

# employment length
def_vis8 <- lc_clean %>%
  ggplot(aes(x=default), group=emp_length) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  + 
  labs(title="Defaults by Employment Length", x="Default, 1=Yes, 0=No", y="relative frequencies")  +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~emp_length) +
  theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),
                 y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5,size = 3)+
  NULL
def_vis8

# default rate per employment length
lc_clean %>% 
  group_by(default,emp_length) %>% 
  summarise(count = n()) %>% 
  pivot_wider(names_from = default,
              values_from = count) %>%
  clean_names() %>% 
  mutate(default_rate = x1/(x0+x1)) %>% 
  ggplot(aes(y=default_rate,x=emp_length))+
  geom_bar(stat = "identity")

# Verification Status
def_vis9 <- lc_clean %>%
  ggplot(aes(x=default), group=verification_status) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  + 
  labs(title="Defaults by Verification Status", x="Default, 1=Yes, 0=No", y="relative frequencies")  +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~verification_status) + theme(legend.position = "none") +
  geom_text(aes( label = scales::percent(round((..count..)/sum(..count..),2) ),
                 y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5)+
  NULL
def_vis9

# 
lc_clean %>% 
  ggplot(aes(x=default,y=(..count..)/sum(..count..),fill=default))+
  geom_bar()+
  # 2x2 grid faceting
  facet_grid(grade ~ verification_status) +
  # x axis in %
  scale_y_continuous(label=scales::percent_format(accuracy = 1))+
  theme_bw()+
  labs(title = "Defaults by Verification Status and Term Months",
       x="",
       y="")+
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ),
            stat= "count",vjust=-0.5, size = 3) +
  ylim(0,0.2)+
  NULL

# Interest rate over time
ggplot(lc_clean, aes (y=int_rate, x=issue_d, color = grade))+
  geom_point()+
  theme_bw()+
  labs(title="Interest Rate Evolution Over Time by Grade",
       x="Issue Date", y="Interest Rate")+
  NULL

  • 36-month loans have a lower proportion of defaults.
  • Borrowers with mortgages are on average less likely to default than those who are renting houses or own houses.
  • People with 10+ years employment length take up the highest proportion in the sample, while the default rates for all groups are similar except for the n/a group.
  • Not verified ones are less likely to default, which is against common sense. Drill down and we can find the reason: Most grade A borrowers are not verified because the grade is already a reliable proof. Therefore, we can expect a higher percentage of grade A borrowers in all Not Verified ones, which explains the anti common sense observation. We need to be careful with this feature in the models later.
  • We can observe the change rate of interest rates with time are different among loans of different grades. Loans of lower grade seem more profitable and the interest rate is increasing, but that only happens when the loans do not default. We should always evaluate both risk and return at the same time to achieve profit maximization.

3 Linear vs. logistic regression for binary response variables

It is certainly possible to use the OLS approach to find the line that minimizes the sum of square errors when the dependent variable is binary (i.e., default no default). In this case, the predicted values take the interpretation of a probability. We can also estimate a logistic regression instead. We do both below.

model_lm <- lm(as.numeric(default)~I(annual_inc/1000), lc_clean)
summary(model_lm)
## 
## Call:
## lm(formula = as.numeric(default) ~ I(annual_inc/1000), data = lc_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1595 -0.1494 -0.1444 -0.1332  1.3268 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.161e+00  2.703e-03 429.312   <2e-16 ***
## I(annual_inc/1000) -2.479e-04  2.918e-05  -8.496   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3501 on 37867 degrees of freedom
## Multiple R-squared:  0.001903,   Adjusted R-squared:  0.001876 
## F-statistic: 72.19 on 1 and 37867 DF,  p-value: < 2.2e-16
logistic1<-glm(default~I(annual_inc/1000), family="binomial", lc_clean)
summary(logistic1)
## 
## Call:
## glm(formula = default ~ I(annual_inc/1000), family = "binomial", 
##     data = lc_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6246  -0.5786  -0.5572  -0.5115   3.6388  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.5191545  0.0292430  -51.95   <2e-16 ***
## I(annual_inc/1000) -0.0040801  0.0003998  -10.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31130  on 37868  degrees of freedom
## Residual deviance: 31005  on 37867  degrees of freedom
## AIC: 31009
## 
## Number of Fisher Scoring iterations: 5
ggplot(data=reduced, aes(x=I(annual_inc/1000), y=default)) +
  geom_smooth(method="lm", se=0, aes(color="OLS"))+
  geom_smooth(method = "glm", method.args = list(family = "binomial"),  se=0, aes(color="Logistic"))+ 
  labs(y="Prob of Default", x="Annual Income(1000 $)")+ 
  xlim(0,450)+
  scale_y_continuous(labels=scales::percent)+
  geom_jitter(width=0, height=0.05, alpha=0.7) +
  scale_colour_manual(name="Fitted Model", values=c("blue", "red"))+
  NULL

Section 2: Comparison of Linear and Logistic Regression

Since the probability cannot be negative, I believe logistic regression model makes more sense to predict the default probability.

4 Multivariate logistic regression

We can estimate logistic regression with multiple explanatory variables as well. Let’s use annual_inc, term, grade, and loan amount as features. Let’s call this model logistic 2.

logistic2<- glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", lc_clean)
summary(logistic2)
## 
## Call:
## glm(formula = default ~ I(annual_inc/1000) + term + grade + loan_amnt, 
##     family = "binomial", data = lc_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0635  -0.6079  -0.4815  -0.3455   4.0988  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.432e+00  5.056e-02 -48.093   <2e-16 ***
## I(annual_inc/1000) -6.019e-03  4.727e-04 -12.733   <2e-16 ***
## term60              4.790e-01  3.561e-02  13.451   <2e-16 ***
## gradeB              6.599e-01  5.270e-02  12.521   <2e-16 ***
## gradeC              1.031e+00  5.392e-02  19.128   <2e-16 ***
## gradeD              1.288e+00  5.703e-02  22.578   <2e-16 ***
## gradeE              1.415e+00  6.662e-02  21.241   <2e-16 ***
## gradeF              1.666e+00  8.630e-02  19.302   <2e-16 ***
## gradeG              1.769e+00  1.340e-01  13.198   <2e-16 ***
## loan_amnt           2.841e-06  2.347e-06   1.211    0.226    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31130  on 37868  degrees of freedom
## Residual deviance: 29286  on 37859  degrees of freedom
## AIC: 29306
## 
## Number of Fisher Scoring iterations: 5
#compare the fit of logistic 1 and logistic 2
anova(logistic1,logistic2)
Resid. DfResid. DevDfDeviance
3.79e+043.1e+04        
3.79e+042.93e+0481.72e+03

Section 3: Logistic Regression Model Explanation a. Estimated Coefficient b. Standard error of coefficient c. p-value of coefficient d. Deviance e. AIC f. Null Deviance g. Is Logistic 2 a better model than logistic 1? Why or why not?

  1. The formula of logistic regression is \(ln(\frac{1-p}{p})=\beta_0+\beta_1*x_1+...\) Therefore, according to the elementary calculus, coefficients are no more marginal value (odds ratio).
    1. With $1,000 increase of annual income, the probability of default will averagely decrease by 0.6%*p*(1-p).
    2. On average, loans of term 60 months have 47.9%*p*(1-p) higher probability of default than those of 36 months.
    3. For grade B to G, their probability of default will be respectively 65.99%*p*(1-p), 103.1%*p*(1-p), 128.8%*p*(1-p), 141.5%*p*(1-p), 166.6%*p*(1-p) and 176.9%*p*(1-p) higher than grade A loans on average.
    4. With $1,000 increase of loan amount, the probability of default will averagely increase by 2.841%*p*(1-p), but this is effect not significant enough.
  2. For coefficient of annual income (divided by 1000), the standard error is 0.00047. For coefficient of loan term (60-month vs 36-month), the standard error is 0.0356. For coefficients of each grade (B to F vs A), the standard errors are 0.0527, 0,0539, 0.0570, 0.0666, 0.0863, 0.0134. (Based on the assumption that the coefficients are normally distributed,) the standard errors above are relatively small enough to have >95% confidence to say the coefficients are different from 0) For coefficients of loan amount, the standard error is 2.347e-06. Compared to coefficient 2.841e-06, this number is too large to say the effect of loan amount is significantly different from 0.
  3. p-values of all coefficients except for loan amount are very close to 0, from which we can draw the same conclusion as above. p-value for coefficient of loan amount is 0.226. That is to say, we have 22.6% probability of falsely rejecting (Type I error) the null hypothesis that the coefficient is different from 0.
  4. Deviance = -2log(L) = 29286, which we want to minimize (if let aside penalties).
  5. AIC = -2log(L) + 2k = 29306, which penalize on the number of estimated coefficients (larger than Deviance).
  6. Null Deviance = 31130, which is the deviance of a model with only the intercept. This value can be regarded as the baseline.
  7. comparison

When compare the result of model 2 with model 1 using anova function, we can see after introducing 8 more coefficients (3 additional variables), the deviance increases by 1719. To dive deeper in the results, we can see AIC also increases from 31009 to 29306 by 1703. It seems a great increase on the goodness-of-fit of the model. However, there is a seemingly redundant variable in model 2 which is insignificant. When talking about the explanation of the model, this could undermine model 2. Generally speaking, there are improvements of model 2.

Section 4: Predictions and density chart

#Predict the probability of default
prob_default2 <- predict(logistic2, lc_clean, type= "response")

#plot 1: Density of predictions
g0 <- ggplot( lc_clean, aes( prob_default2))+
  geom_density(size=1)+
  ggtitle("Predicted Probability with Logistic 2")+  
  xlab("Estimated Probability")+
  NULL
g0

#plot 2: Density of predictions by default
g1 <- ggplot(lc_clean, aes(prob_default2, color=default)) +
  geom_density( size=1)+
  ggtitle("Predicted Probability with Logistic 2")+
  xlab("Estimated Probability")+
  NULL
g1

4.1 From probability to classification

The logistic regression model gives us a sense of how likely defaults are; it gives us a probability estimate. To convert this into a prediction, we need to choose a cutoff probability and classify every loan with a predicted probability of default above the cutoff as a prediction of default (and as a prediction of non-default for loans with a predicted probability below this cutoff).

Let’s choose a threshold. Of course some of our predictions will turn out to be right but some will turn out to be wrong – you can see this in the density figures of the previous section. Let’s call “default” the “positive” class since this is the class we are trying to predict. We could be making two types of mistakes. False positives (i.e., predict that a loan will default when it will not) and false negatives (I.e., predict that a loan will not default when it will). These errors are summarized in the confusion matrix.

Section 5: Confusion matrix for the model logistic 2 for a cutoff of 16%

#Call any loan with probability more than 16% as default and any loan with lower probability as non-default. Make sure your prediction is a factor with the same levels as the default variable in the lc_clean data frame
p_class <- factor(ifelse(prob_default2>0.16,"1","0")) #If the the probability is great than the threshold of 0.16 then output 1 otherwise 0

#produce the confusion matrix and set default as the positive outcome
con2<-confusionMatrix(p_class,lc_clean$default,positive="1") #the first input is the class prediction, the second input is the actual outcomes. We also define the positive outcome to be "1" (i.e., default is the outcome we consider "positive"). The output is a confusion matrix.

#print the confusion matrix
con2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21603  2204
##          1 10837  3225
##                                           
##                Accuracy : 0.6556          
##                  95% CI : (0.6508, 0.6604)
##     No Information Rate : 0.8566          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1564          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59403         
##             Specificity : 0.66594         
##          Pos Pred Value : 0.22934         
##          Neg Pred Value : 0.90742         
##              Prevalence : 0.14336         
##          Detection Rate : 0.08516         
##    Detection Prevalence : 0.37133         
##       Balanced Accuracy : 0.62998         
##                                           
##        'Positive' Class : 1               
## 

Secion 6. Explainatino of Confusion Matrix a. Accuracy b. Sensitivity c. Specificity For each of these explain what they mean in the context of the lending club and the goal of predicting loan defaults.

  1. Accuracy = (True Positive + False Negative) / (True Positive + True Negative + False Positive + False Negative) = 0.6556 This metric measures the proportion of correct predictions. In this case it means 65.56% of all the loans in the sample are correctly predicted, but we don’t know how many are defaults or how many are not.
  2. Sensitivity = TP rate = TP / (TP + FN) = 3225 / (3225 + 2204) = 0.59403 This metric measures of all positive samples (TP + FN) how much the model predicts correctly. In this case it means 59.403% of defaults are captured by the model (~40.6% can be falsely invested in and lose money).
  3. Specificity = TN rate = TN / (TN + FP) = 21603 / (21603 + 10837) = 0.66594 This metric measures of all negative samples (TN + FP) how much the model predicts correctly. In this case it means 66.59% of the non-defaults are captured, and we may miss the opportunities of earning more on the rest loans (~33.4%).

Goal: the goal is to prevent investors from giving money to those who will default, which means we may lose most of the investment. As a trade-off, we can sacrifices on the potential opportunities and miss some non-defaults with high returns, but we must make sure most of our non-defaults decisions are correct. Thus, I think we should aim at maximizing the TN / (TN+FN) (aka. negative pred value in the table) to make sure what we want to invest in have a higher change not to default. Higher Sensitivity also helps to prevent investors from investing in potential defaults.

Section 7: ROC curve and AUC measure.

The codes below will produce the ROC curve with AUC measure.

#estimate the ROC curve for Logistic 2
ROC_logistic2 <-roc(lc_clean$default, prob_default2) # the first argument is a vector of outcomes and the second is a vector of probabilities associated with each outcome

#estimate the AUC for Logistic 2 and round it to two decimal places
AUC2 <- round(auc(lc_clean$default,prob_default2)*100, digits=2)
#Plot the ROC curve and display the AUC in the title
ROC2 <- ggroc(ROC_logistic2,  alpha = 0.5)+ 
  ggtitle(paste("Model Logistic 2: AUC=",AUC2,"%"))  +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")+
  geom_segment(aes(x = 1, xend = 1, y = 0, yend = 1), color="black", linetype="dashed")+
  geom_segment(aes(x = 1, xend = 0, y = 1, yend = 1), color="black", linetype="dashed")

ROC2

ROC curve is produced when we walk through all the possible cut-offs and get the corresponding measures (Sensitivity and Specificity). Apparently, when the cut-off of probability moves from 1 (All Negative prediction) to 0 (All Positive prediction), the specificity should move from 1 to 0 and the sensitivity should move from 0 to 1. AUC means Area under the curve, indicating how much the model is better than flipping a coin (50% positive and 50% negative randomly).

The AUC cannot exceed 1 because the specificity and sensitivity cannot exceed 1 by definition (denominator larger than numerator). It cannot be lower than 0.5 because 0.5 is the baseline value of a random model. Since we have improved the probability prediction model from a single intercept model, the data points with higher probability should contain more defaults records, which is supposed to be better than randomly sorted points.

Section 8: Out-of-sample ROC curve

# splitting the data into training and testing
set.seed(1234)
train_test_split <- initial_split(lc_clean, prop = 0.7)
testing <- testing(train_test_split) #20% of the data is set aside for testing
training <- training(train_test_split) #80% of the data is set aside for training

# run logistic 2 on the training set 
logistic2<- glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", training)

#calculate probability of default in the training sample 
p_in<- predict(logistic2, training, type= "response")

#ROC curve using in-sample predictions
ROC_logistic2_in <- roc(training$default,p_in) 
#AUC using in-sample predictions
AUC_logistic2_in <- round(auc(training$default,p_in)*100, digits=2)
  
#calculate probability of default out of sample 
p_out <- predict(logistic2, testing, type= "response")

#ROC curve using out-of-sample predictions
ROC_logistic2_out <- roc(testing$default,p_out) 
#AUC using out-of-sample predictions
AUC_logistic2_out <- round(auc(testing$default,p_out)*100, digits=2)

#plot in the same figure both ROC curves and print the AUC of both curves in the title
ggroc(list(ROC_logistic2_in,ROC_logistic2_out))+
  ggtitle(paste("Model Logistic 2: AUC_in=",AUC_logistic2_in,"%, ","AUC_out=",AUC_logistic2_out,"%"))  +
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")+
  geom_segment(aes(x = 1, xend = 1, y = 0, yend = 1), color="black", linetype="dashed")+
  geom_segment(aes(x = 1, xend = 0, y = 1, yend = 1), color="black", linetype="dashed")+
  scale_color_manual(name="Dataset",
                     labels=c("training","testing"),
                     values=c("red","blue"))+
  NULL

68.07% in training set against 67.81% in testing set. The difference is not much and the ROC curves are quite close. Thus, I believe overfitting is not a problem in this model.

4.2 Selecting loans to invest in using the model Logistic 2.

Before we look for a better model than logistic 2 let’s see how we can use this model to select loans to invest in. Let’s make the simplistic assumption that every loan generates $25 profit if it is paid off and $90 loss if it is charged off for an investor. Let’s use a cut-off value to determine which loans to invest in, that is, if the predicted probability of default for a loan is below this value then we invest in that loan and not if it is above.

To do this we split the data in three parts: training, validation, and testing. Feel free to experiment with different seeds but please use the seeds provided below for your submission.

# splitting the data into training and testing
set.seed(1234)
train_test_split <- initial_split(lc_clean, prop = 0.6)
training <- training(train_test_split) #60% of the data is set aside for training
remaining <- testing(train_test_split) #40% of the data is set aside for validation & testing
set.seed(4321)
train_test_split <- initial_split(remaining, prop = 0.5)
validation<-training(train_test_split) #50% of the remaining data (20% of total data) will be used for validation
testing<-testing(train_test_split) #50% of the remaining data (20% of total data) will be used for testing

Section 9: Determin optimal cutoff threshold

Train logistic 2 on the training set above. Use the trained model to determine the optimal cut-off threshold based on the validation test. What is the optimal cutoff threshold? How much profit does it generate? Using the testing set, what is the profit per loan associated with the cutoff?

# estimate the model on the training set
logistic2<-glm(default~I(annual_inc/1000) + term + grade + loan_amnt, family="binomial", training)

p_val<-predict(logistic2, validation, type = "response") #predict probability of default on the validation set

# select the cutoff threshold using the estimated model and the validation set
profit=0
threshold=0
for (i in 1:500) {
  threshold[i]=i/500
  one_or_zero_search<-ifelse(p_val>threshold[i],"1","0")
  p_class_search<-factor(one_or_zero_search,levels=levels(validation$default))

  con_search<-confusionMatrix(p_class_search,validation$default,positive="1")
  profit[i]=con_search$table[1,1]*25-con_search$table[1,2]*90
}

ggplot(as.data.frame(threshold), aes(x=threshold,y=profit)) +
  geom_smooth(method = 'loess', se=0) +
  labs(title="Profit curve with logistic 2 based on validation set")+
  NULL

# output the maximum profit and the associated threshold
paste0("Based on the validation set: Maximum profit per loan is $", 
       round(max(profit)/nrow(validation),2), " achieved at a threshold of ",
       threshold[which.is.max(profit)]*100,"%.")
## [1] "Based on the validation set: Maximum profit per loan is $10.33 achieved at a threshold of 23%."
# optimal threshold based on the validation set
threshold = threshold[which.is.max(profit)]

# Use the model estimated on the training set to predict probabilities of default on the testing set
p_test <- predict(logistic2, testing, type = "response")

#use the threshold estimated using the validation set to estimate the profits on the testing set
one_or_zero <- ifelse(p_test>threshold, "1", "0")
p_class <- factor(one_or_zero,levels=levels(testing$default))
con <- confusionMatrix(p_class,testing$default,positive="1")
profit = con$table[1,1]*25 - con$table[1,2]*90
paste0("Based on the testing set the actual profit per loan is: $", round(profit/nrow(testing),2))
## [1] "Based on the testing set the actual profit per loan is: $9.68"

On validation set, the optimal cutoff threshold is 23%, which will generate profit of $9.68 on testing set.

5 More realistic revenue model

Let’s build a more realistic profit and loss model. Each loan has different terms (e.g., different interest rate and different duration) and therefore a different return if fully paid. For example, a 36 month loan of $5000 with installment of $163 per month would generate a return of 163*36/5000-1 if there was no default. Let’s assume that it would generate a loss of -70% if there was a default (the loss is not 100% because the loan may not default immediately and/or the lending club may be able to recover part of the loan).

Section 10: Invest $1 in each loan

Under these assumptions, how much return would you get if you invested $1 in each loan in the validation set? Express your answer as a % return.

validation = validation %>% 
  mutate(ret= installment * term_months/loan_amnt - 1,
         actual_return = ifelse(default==1, -0.7, ret))
paste0("The return will be: ",
  scales::percent(sum(validation$actual_return)/nrow(validation),
                  accuracy = 0.01))
## [1] "The return will be: 10.22%"

Unfortunately, we cannot use the realized return to select loans to invest in (as at the time we make the investment decision we do not know which loan will default). Instead, we can calculate an expected return using the estimated probabilities of default – expected return = return if not default * (1-prob(default)) + return if default * prob(default).

Section 11: Loans Selection

In the following questions, we will still equally invest in each of the loan selected by the model, which is also the recommended way on Lending Club website to mitigate risks. By investing in this way, we will not suffer from losing huge amount of money because some huge amount loans default at the end. Suppose we still invest $1 in each although the actual minimum requirement is $25 for Lending Club platform. However, the rates of return will be the same.

Here we calculate the expected return of the loans in the validation set using logistic 2 model trained in the training set to select a portfolio of the \(n\) most promising loans to invest in (\(n\) is an integer number). First, suppose \(n=800\).

validation$prob_default = p_val
validation = validation %>% 
  mutate(expected_return = -0.7 * prob_default + ret * (1 - prob_default))

# specify the number of loans
n=800
# still assume equal weights of the loans in the portfolio
r = validation %>% 
  slice_max(order_by = expected_return, n=n) %>% 
  select(actual_return) %>% 
  sum()/n

paste0("The rate of return will be: ",
       scales::percent(r, accuracy = 0.01))
## [1] "The rate of return will be: 19.36%"
paste0("The profit will be: $",
       round(r*n,2))
## [1] "The profit will be: $154.87"

The return is 19.36% when invest $1 equally in each of the top 800 promising loans (with highest expected return) in validation set. The profit is $154.87.

ror=0
profit=0
for (n in 1:1000){
  r = validation %>% 
  slice_max(order_by = expected_return, n=n) %>% 
  select(actual_return) %>% 
  sum()/n
  ror[n] = r
  profit[n] = n*r
}
ggplot(as.data.frame(1:1000)) +
  geom_line(aes(x=1:1000,y=ror))+
  geom_line(aes(x=1:1000,y=profit/200), linetype = "dashed")+
  labs(title="Return and Profit curve with logistic 2 based on validation set (Expected Return)")+
  xlab("Top n promising loans")+
  scale_y_continuous(name="Rate of Return",
                     labels = scales::percent,
                     sec.axis = sec_axis(trans=~.*200, name="Profit"))+
  NULL

We can observe as the number of loans we choose to invest increases, the rate of return becomes lower and less fluctuate. (The profit increases because the investment increases as well). I think it’s good to avoid risks by investing the same amount of money into more loans.

Section 12: Sensitivity to loss percentage of default

The following codes will show how the return will change if the loss moves between 20% and 80%, for \(n=800\).

loss_exp = seq(0.2,0.8,0.01)
i=1
n=800
ror=0
profit=0
for (l in loss_exp){
  validation = validation %>% 
    mutate(expected_return = -l * prob_default + ret * (1 - prob_default))
  r = validation %>% 
    slice_max(order_by = expected_return, n=n) %>% 
    select(actual_return) %>% 
  sum()/n
  ror[i] = r
  profit[i] = n*r
  i = i + 1
}
ggplot(as.data.frame(loss_exp)) +
  geom_line(aes(x=loss_exp,y=ror))+
  geom_line(aes(x=loss_exp,y=profit/1000), linetype = "dashed")+
  labs(title="Return and Profit curve with logistic 2 based on validation set (Expected Return)")+
  xlab("Top n promising loans")+
  scale_x_continuous(name="Loss of default",
                     labels = scales::percent)+
  scale_y_continuous(name="Rate of Return",
                     labels = scales::percent,
                     sec.axis = sec_axis(trans=~.*1000, name="Profit"))+
  NULL

The return moves form 18.04% to 19.68% and the profit moves between $144.3664 and $157.4051. The model is not very sensitive to the loss percentage of default. When the percentage increases, the model will adjust the expected return and choose more secure loans to maintain the return.

Section 13: Experiment with different models using more features, interactions, non-linear transformations and regularizations.

set.seed(121)
#LASSO logistic regression -- choosing the penalty lambda based on out-of-sample ROC metric
trainControl <- trainControl(
  method = "cv",
  number = 10,
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = FALSE
)
model_lasso <- train(def~ term*grade*loan_amnt*poly(annual_inc,5)+emp_length+verification_status,
            data = training, 
            method = "glmnet", 
            trControl = trainControl,
            metric = "ROC", # Optimize by AUC of ROC
            family="binomial",
            preProc = c("center", "scale"), #This option standardizes the data before running the LASSO regression
             tuneGrid=expand.grid(
              .alpha=1,
              .lambda=seq(0, 0.1, length = 101))) # search for the best lambda between 0 and 0.1 in increments of 0.001 (for a total of 101 steps).
#note that we are estimating 1,010 different models (101 values of lambda in each of 10 folds) so this might take a little time to estimate

#Compute the out of sample ROC curve for the LASSO model of best fit. Note that when we use caret to fit logistic regression, we need to use type="prob" to predict probabilities instead of type="response" when we use the glm package. (Ah, the subtleties of a free open-source software).
pred<-predict(model_lasso, validation, type = "prob")
x<-ifelse(validation$def=="D",1,0)
ROC_lasso <- roc(x,pred[,1])

#Compare it with logistic regression (this is equivalent to lasso with lambda 0)
logistic<-glm(def~ term*grade*loan_amnt*poly(annual_inc,5), family="binomial", training)
p_out<-predict(logistic, validation, type = "response")
ROC_logistic_out <- roc(validation$default,p_out) 

prob_default2 <- predict(logistic2, validation, type = "response")
ROC_logistic2 <- roc(validation$default, prob_default2) 


#Plot the ROC curves
ggroc(list("LASSO Logistic out-of-sample"=ROC_lasso, "Logistic out-of-sample"=ROC_logistic_out,
           "Logistic 2 out-of-sample"=ROC_logistic2))+
  ggtitle(paste("Model LASSO Logistic out-of-sample AUC=",round(auc(x,pred[,1])*100, digits=2),
                "%\nModel Logistic out-of-sample AUC=",round(auc(validation$default,p_out)*100, digits=2),"%",
                "%\nModel Logistic 2 out-of-sample AUC=",round(auc(validation$default,prob_default2)*100, digits=2),"%"))+
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")

validation$prob_default = pred
validation %>% 
  mutate(expected_return = -0.7 * prob_default + ret * (1-prob_default),
         actual_return = ifelse(default==1, -0.7, ret)) %>% 
  filter(expected_return>0) %>% 
  slice_max(order_by = expected_return, n = 800) %>% 
  select(actual_return) %>% 
  sum()
## [1] 161.2087

The Logistic Lasso model can achieve an AUC of ROC at 69.9%. The profit is $161.2087 and rate of return is 20.15%)

While the best logistic regression model I have explored can achieve a rate of return of 19.83%, profit at $158.67. AUC score at 68.46 and AIC at 17689 on validation set.

Logistic 2 only generates 19.36% ($154.87) with AUC at 68.46, AIC at 17701.

# there are 14760 coefficients.
summary(model_lasso$finalModel)
##             Length Class      Mode     
## a0             82  -none-     numeric  
## beta        14760  dgCMatrix  S4       
## df             82  -none-     numeric  
## dim             2  -none-     numeric  
## lambda         82  -none-     numeric  
## dev.ratio      82  -none-     numeric  
## nulldev         1  -none-     numeric  
## npasses         1  -none-     numeric  
## jerr            1  -none-     numeric  
## offset          1  -none-     logical  
## classnames      2  -none-     character
## call            5  -none-     call     
## nobs            1  -none-     numeric  
## lambdaOpt       1  -none-     numeric  
## xNames        180  -none-     character
## problemType     1  -none-     character
## tuneValue       2  data.frame list     
## obsLevels       2  -none-     character
## param           1  -none-     list

Section 14: For this question you will not need to use the Lending Club dataset. Suppose you are helping a government authority to decide what type of Covid-19 rapid virus detection tests to use across the country for nursing home residents and for daily wage construction site workers, for regular periodic testing. Nursing home residents are elderly retired individuals and often have ailments like diabetes or asthma which result in worse illness and higher risk of death, if they get Covid-19. On the other hand, daily wage construction workers have an average age of 35 and are relatively healthy. They are often the only earning member in their household, and they typically do not have savings that they can draw from in times of need. Three rapid tests are available, which vary in their sensitivity and specificity. Test A has high specificity and low sensitivity while test B has low specificity and high sensitivity. Test C has medium specificity and medium sensitivity. You need to pick a single test for use in nursing homes, and a single (possibly the same) test for use at construction sites. The tests will be offered free of cost. Which test(s) would you pick, and why? State any assumptions.

You have been asked to assess whether, for select individuals in each subpopulation (nursing homes and construction sites), it would be better if they could be given a different test from the one you recommended above for their subpopulation. If you could gather additional features related to the individuals in these two populations to support your argument, list 3 features you would gather, and why. State any assumptions.

Nursing homes: Type B. for elderly residents, we have to identify as most positive cases as we can, even make some mistakes. Or any delays will maybe lead to loss of life. Thus I would choose Type B with higher sensitivity, which enables us to identify higher percentage positive of all positives. Every percent higher could mean that a life will be saved.

Construction sites: Type C. for construction workers, they are relatively young and strong. They earn money every day when they are working, so it’s not appropriate to stop them from working and send them to hospital and even quarantine because of false positive test results. That will even make their families struggle. In the other way, we also cannot tolerate positive cases hanging around for too long with other construction workers and result in a wider range of pandemic. Thus I would choose Type C with both medium sensitivity and specificity, so most of the workers can pass the test to feel safe without too much danger of letting positive cases go.

Additional Features: - Age: age is an important measure to know the general health status of a person. There could be younger workers or residents who would like to pass the tests faster. They should have the choice to use the test more easily and maybe more frequently to make themselves feel safe. - Diabetes or asthma records: as stated, these problems will result in higher risk of death. Thus, even though they are young and strong, they still want to know whether they get Covid-19 more accurately to prevent themselves from entering ICU. - Number of siblings: Normally it’s difficult to find people’s exact personal savings and income. They may lie on that in survey. But government can easily access the number of siblings. Generally speaking, for young workers if they have siblings their family may be able to afford more risks.

Section 15: Investment competition

For this question, investment strategy is an important factor. I still think it will be less risky to invest equally in the loans rather than feed the money as much as the borrowers want. Once some big amount loans default, the financial crisis will be close. I am not sure about the evaluation process of this question, but for what I choose after Question 10, they are all equally invested. (Actually, others may have already invested some money and made the investment you can make less than loan amount in the data set.) This is also the best practice recommended by Lending Club ($25 in each) and closer to reality (Sometimes we do not have that big amount of money to satisfy 200 loans for no matter how much they want). We just equally invest our budget, make the investment as dispersed as we can, and gain the average return.

lc_assessment<- read_csv("Assessment Data_2021.csv") %>%  #load the data 
  clean_names() %>% # use janitor::clean_names() 
  mutate(
    issue_d = mdy(issue_d),  # lubridate::mdy() to fix date format
    term = factor(term_months),     # turn 'term' into a categorical variable
    delinq_2yrs = factor(delinq_2yrs)) # turn 'delinq_2yrs' into a categorical variable

p_assessment <- predict(model_lasso, lc_assessment, type= "prob")
# assign the predicted probability
lc_assessment$prob_default = p_assessment

loan_numbers = lc_assessment %>%
  mutate(ret= installment * term_months/loan_amnt - 1,
         expected_return = -0.7 * prob_default + ret * (1-prob_default)) %>% 
  filter(expected_return>0) %>% # I don't set a cutoff threshold for probability, but for expected return. Once it's positive, it's worth a try.
  slice_max(order_by = expected_return, n=200) %>% 
  select(loan_number)
results = data.frame(loan_number = 1:1800) %>% 
  mutate(yuan_gao = ifelse(loan_number %in% loan_numbers$loan_number, "1", "0"))
write_csv(results,"yuan_gao.csv")

6 Critique

No data science engagement is perfect.

Section 16: A critique

The model considers mostly returns but not risks. There could be many sources of risks in the market of different areas like influential political or social events, economic crisis and natural catastrophes within the 36 or 60 months length. The model does not take those possibilities into account and guides investors to give their money out. Learning and considering the variances (beta) of returns may help to minimize potential risks caused by market fluctuates and even big events.

Section 17. Privacy information

It’s immoral and sometimes even illegal to use information of clients like gender and race to make a decision whether to lend them money. Here we believe people of different genders and races have same probability of default. People default because of some special circumstances instead of their gender or race. We should not consider such kind of information.